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Abstract 

Word-level  alignments  of  bilingual  text  (bitexts)  are  not  an  integral  part  of  statistical 
machine  translation  models,  but  also  useful  for  lexical  acquisition,  treebank  construction, 
and  part-of-speech  tagging.  The  frequent  occurrence  of  divergences,  structural  differences 
between  languages,  presents  a  great  challenge  to  the  alignment  task.  We  resolve  some 
of  the  most  prevalent  divergence  cases  by  using  syntactic  parse  information  to  transform 
the  sentence  structure  of  one  language  to  bear  a  closer  resemblance  to  that  of  the  other 
language.  In  this  paper,  we  show  that  common  divergence  types  can  be  found  in  multi¬ 
ple  language  pairs  (in  particular,  we  focus  on  English-Spanish  and  English- Arabic)  and 
systematically  identified.  We  describe  our  techniques  for  modifying  English  parse  trees 
to  form  resulting  sentences  that  share  more  similarity  with  the  sentences  in  the  other 
languages;  finally,  we  present  an  empirical  analysis  comparing  the  complexities  of  per¬ 
forming  word-level  alignments  with  an  without  divergence  handling.  Our  results  suggest 
that  divergence-handling  can  improve  word-level  alignment. 
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Abstract 

Word-level  alignments  of  bilingnal  text  (bitexts)  are  not  only  an  integral  part  of 
statistical  machine  translation  models,  bnt  also  nsefnl  for  lexical  acqnisition,  treebank 
constrnction,  and  part-of-speech  tagging.  The  freqnent  occnrrence  of  divergences , 
strnctnral  differences  between  langnages,  presents  a  great  challenge  to  the  alignment 
task.  We  resolve  some  of  the  most  prevalent  divergence  cases  by  nsing  syntactic  parse 
information  to  transform  the  sentence  strnctnre  of  one  langnage  to  bear  a  closer 
resemblance  to  that  of  the  other  langnage.  In  this  paper,  we  show  that  common 
divergence  types  can  be  fonnd  in  mnltiple  langnage  pairs  (in  particnlar,  we  focns  on 
English- Spanish  and  English- Arabic)  and  systematically  identihed.  We  describe  onr 
techniqnes  for  modifying  English  parse  trees  to  form  resnlting  sentences  that  share  more 
similarity  with  the  sentences  in  the  other  langnages;  hnally,  we  present  an  empirical 
analysis  comparing  the  complexities  of  performing  word-level  alignments  with  and 
withont  divergence  handling.  Onr  resnlts  snggest  that  divergence-handling  can  improve 
word-level  alignment. 
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Abstract 

Word- level  alignments  of  Piling nal 
text  (bitexts)  are  not  only  an  inte¬ 
gral  part  of  statistical  machine  trans¬ 
lation  models,  bnt  also  nsefnl  for  lexi¬ 
cal  acqnisition,  treebank  constrnction, 
and  part-of-speech  tagging.  The  fre- 
qnent  occnrrence  of  divergences,  strnc- 
tnral  differences  between  langnages, 
presents  a  great  challenge  to  the  align¬ 
ment  task.  We  resolve  some  of  the 
most  prevalent  divergence  cases  by 
nsing  syntactic  parse  information  to 
transform  the  sentence  strnctnre  of 
one  langnage  to  bear  a  closer  resem¬ 
blance  to  that  of  the  other  langnage. 

In  this  paper,  we  show  that  common 
divergence  types  can  be  fonnd  in  mnl- 
tiple  langnage  pairs  (in  particnlar,  we 
focns  on  English- Spanish  and  English- 
Arabic)  and  systematically  identihed. 

We  describe  onr  techniqnes  for  modify¬ 
ing  English  parse  trees  to  form  resnlt- 
ing  sentences  that  share  more  similar¬ 
ity  with  the  sentences  in  the  other  lan¬ 
gnages;  hnally,  we  present  an  empirical 
analysis  comparing  the  complexities 
of  performing  word-level  alignments 
with  and  withont  divergence  handling. 

Onr  resnlts  snggest  that  divergence- 
handhng  can  improve  word-level  align¬ 
ment. 

1  Introduction 

Word-level  alignments  of  bilingnal  text  (bi¬ 
texts)  are  not  only  an  integral  part  of  statis¬ 
tical  machine  translation  models,  bnt  also  nse¬ 
fnl  for  lexical  acqnisition,  treebank  constrnc¬ 
tion,  and  part-of-speech  tagging  (Yarowsky  and 
Ngai,  2001).  The  freqnent  occnrrence  of  “di¬ 


vergences”,  strnctnral  differences  between  lan¬ 
gnages,  presents  a  great  challenge  to  the  align¬ 
ment  task.  In  this  paper,  we  show  that  com¬ 
mon  divergence  types  can  be  fonnd  in  mnlti- 
ple  langnage  pairs  and  systematically  identi¬ 
hed.  We  focns  on  Enghsh- Spanish  and  English- 
Arabic,  presenting  techniqnes  for  modifying  En¬ 
glish  parse  trees  to  form  resnlting  sentences 
that  share  more  similarity  with  the  sentences 
in  the  other  langnages.  We  resolve  some  of  the 
most  prevalent  divergence  cases  by  nsing  syn¬ 
tactic  parse  information  to  transform  the  sen¬ 
tence  strnctnre  of  one  langnage  to  bear  a  closer 
resemblance  to  that  of  the  other  langnage. 

The  following  three  ideas  motivate  the  de¬ 
velopment  of  antomatic  “divergence  correction” 
techniqnes: 

1.  Every  langnage  pair  has  translation  diver¬ 
gences  that  are  easy  to  recognize. 

2.  Knowing  what  they  are  and  how  to  accom¬ 
modate  them  provides  the  basis  for  rehned 
word- level  alignment. 

3.  Improved  word-level  alignment  resnlts  in 
improved  projection  of  strnctnral  informa¬ 
tion  from  English  to  the  foreign  langnage. 

This  paper  elaborates  primarily  on  points  1  and 
2  above,  bnt  onr  nltimate  goal  is  to  set  these 
in  the  context  of  3,  i.e.,  for  training  foreign- 
langnage  parsers  to  be  nsed  statistical  machine 
translation. 

A  divergence  occnrs  when  the  nnderlying  con¬ 
cepts  or  gist  of  a  sentence  is  distribnted  over 
different  words  for  different  langnages.  For 
example,  the  notion  of  floating  across  a  river 
is  expressed  as  float  across  a  river  in  En¬ 
glish  and  cross  a  river  floating  (atraveso  el  n'o 
flotando)  in  Spanish  (Dorr,  1993)  or  similarly 
('Is in  Arabic. 
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While  seemingly  transparent  for  human  read¬ 
ers,  this  throws  statistical  aligners  for  a  serious 
loop.  Far  from  being  a  rare  occurrence,  our 
preliminary  investigations  revealed  that  diver¬ 
gences  occurred  in  approximately  1  out  of  every 
3  sentences  in  a  sample  size  of  19K  sentences 
from  the  TREC  El  Norte  Newspaper  Corpus^ 
(using  automatic  detection  techniques  followed 
by  human  conhrmation).  Thus,  hnding  a  way 
to  deal  effectively  with  these  divergences  and  re¬ 
pair  them  would  be  a  massive  advance  for  bilin¬ 
gual  alignment. 

The  current  avenue  of  research  involves  trans¬ 
forming  English  into  a  pseudo- English  form 
(which  we  caU  E')  that  more  closely  matches 
the  physical  form  of  the  foreign  language,  e.g., 
“float  across  a  river”  becomes  “cross  a  river 
floating”  if  the  foreign  language  is  Spanish.  Ide¬ 
ally,  this  rewriting  of  the  English  sentence  cre¬ 
ates  more  one-to-one  correspondences  which,  in 
turn,  facilitates  our  statistical  alignment  pro¬ 
cess.  The  key  is  to  identify  possible  rewritings 
for  known  examples  (our  training  set),  which 
then  generalize  to  rewritings  for  examples  not 
yet  covered  in  our  training  set.  In  theory, 
our  rewriting  approach  apphes  to  all  divergence 
types.  Thus,  given  a  corpus,  divergences  are 
identihed,  rewritten,  and  then  run  through  the 
statistical  aligner  of  choice. 

The  alignment  process  is  enhanced  by  the  in¬ 
jection  of  linguistic  knowledge  into  the  standard 
parameter  tables  used  in  statistical  language 
modeling:  the  distortion  table  (d) — for  reorder¬ 
ing  the  words  of  the  English  sentence,  the  trans¬ 
lation  table  (t) — for  translating  the  words  of  the 
Enghsh  sentence,  and  the  insertion  table  (n) — 
for  inserting  words  into  the  Enghsh  sentence. 
Although  statistical  alignment  relies  on  the  val¬ 
ues  of  the  parameters  in  these  three  tables,  the 
process  need  not  have  any  deeper  knowledge 
other  than  these  values.  For  example,  the  head¬ 
swapping  divergence  above  based  on  the  notion 
of  “float  across”  would  involve  a  reordering  of 
the  English  words,  as  dictated  by  the  values  in 
the  d-table,  prior  to  alignment  with  the  Spanish 

^This  analysis  was  done  on  the  TREC  Spanish  Data, 
LDC  catalog  no  LDC2000T51,  ISBN  1-58563-177-9, 
2000. 


sentence  containing  “atraveso  llotando”. 

The  next  section  sets  this  work  in  the  context 
of  related  work  on  alignment  and  projection  of 
structural  information  between  languages.  Sec¬ 
tion  3  describes  the  range  of  divergence  types 
covered  in  this  work  (with  examples  in  Spanish 
and  Arabic).  Section  5  describes  an  experiment 
we  undertook  to  examine  the  benehts  of  inject¬ 
ing  linguistic  knowledge  into  the  ahgnment  pro¬ 
cess.  We  present  an  empirical  analysis  compar¬ 
ing  the  complexities  of  performing  word-level 
alignments  with  and  without  divergence  han¬ 
dling.  We  conclude  that  annotators  agree  with 
each  other  more  consistently  when  performing 
word-level  ahgnments  on  bitext  with  divergence 
handling. 

2  Related  Work 

Recently,  researchers  have  extended  traditional 
statistical  machine  translation  (MT)  models 
(Brown  et  ah,  1990;  Brown  et  ah,  1993)  to 
include  the  syntactic  structures  of  the  lan¬ 
guages  (Alshawi  et  ah,  2000;  Alshawi  and 
Douglas,  2000;  Wu,  1997).  Furthermore, 
Yamada  and  Knight  (2001)  have  shown  that 
MT  models  are  signihcantly  improved  when 
trained  on  syntactically  annotated  data  (Ya¬ 
mada  and  Knight,  2001).  The  cost  of  hu¬ 
man  labor  in  producing  annotated  treebanks 
is  often  prohibitive,  thus  making  the  construc¬ 
tion  of  such  data  for  new  languages  infeasible. 
Some  researchers  have  developed  techniques  for 
fast  acquisition  of  hand-annotated  Treebanks 
(Fellbaum  et  ah,  2001).  Others  have  appealed 
to  machine  learning  techniques.  For  example, 
the  work  of  Hermjakob  and  Mooney  (1997)  and 
Hwa  (2000)  aimed  to  minimize  the  amount  of 
annotated  data  needed  to  induce  a  parser. 

In  a  cross-language  setting,  some  researchers 
have  taken  approaches  that  circumvent  the  con¬ 
struction  of  annotated  treebanks,  producing 
MT  systems  without  parsers  and  dictionaries. 
One  example  is  the  Expedition  effort,  an  en¬ 
terprise  to  machine  translation  capability  for 
low-density  languages  in  short  periods  of  time 
(Amtrup  et  ah,  1999).  Others  have  proposed 
to  induce  a  parser  from  a  noisy  treebank  of 
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foreign-language  dependency  trees  that  are  au¬ 
tomatically  projected  from  English  (Hwa  et 
ah,  2002).  Our  approach  is  the  most  rele¬ 
vant  to  the  latter  in  that  it  provides  a  signif¬ 
icantly  noise-reduced  foreign-language  depen¬ 
dency  treebank  for  inducing  a  foreign  language 
parser.  We  repair  misalignments  resulting  from 
cross-language  divergences,  bringing  about  a 
more  accurate  dependency-tree  projection  from 
Enghsh  into  the  foreign  language. 

3  Background:  Divergences 

As  stated  at  the  outset,  every  language  pair  has 
divergences  that  are  easy  to  recognize.  Our  em¬ 
pirical  work  on  English- Spanish  and  Enghsh- 
Arabic  translation  has  revealed  that  there  six 
divergences  of  interest.  Each  one  is  described, 
in  turn,  below. 

3.1  Light  Verb  Construction 

A  light  verb  construction  involves  a  single  verb 
in  one  language  being  translated  using  a  com¬ 
bination  of  a  semantically  “light”  verb,  i.e.,  it 
carries  little  or  no  specihc  meaning  in  its  own 
right,  and  some  other  meaning  unit  (perhaps  a 
noun)  to  convey  the  appropriate  meaning.  En¬ 
glish  light  verbs  include  give,  make,  do,  take, 
and  have.  Our  hndings  indicate  that  Spanish 
tends  to  be  more  verbose  (employing  the  Light 
Verb  Construction)  than  English;  by  contrast, 
Arabic  tends  to  be  more  contracted,  mapping  to 
a  Light  Verb  Construction  on  the  English  side: 

(la)  English-Spanish: 

to  kick  -O-  dar  una  patada  (give  a  kick) 
to  end  -O-  poner  fin  (pnt  end) 
to  note  -O-  tomar  nota  (take  note) 

(lb)  English- Arabic: 

to  do  well  -O-  (do-good) 

to  be  not  -O-  (_>-J  (be-not) 
to  make  do  (make-do) 

3.2  Manner  Conflation 

This  divergence  is  involves  translating  of  a  sin¬ 
gle  manner  verb  (e.g.,  float)  as  a  hght  verb  of 
motion  and  a  manner-indicating  content  word. 
In  Spanish,  typically  the  content  word  is  a  pro¬ 
gressive  manner  verb  whereas  Arabic  generally 


involves  the  translation  a  verb  into  an  English 
verb  of  motion  and  an  adverbial  describing  an 
aspect  of  the  motion  (such  as  direction). 

(2a)  English-Spanish: 

to  float  ir  fiotando  (go  (via)  floating) 
to  pass  ir  pasando  (go  passing) 

(2b)  English-Arabic: 

to  take  ont  ^ jj-I  (take-ont) 
to  come  again  j  (retnrn) 

to  go  west  (go-west) 

3.3  Head  Swapping 

This  divergence  involves  the  demotion  of  the 
head  verb  and  the  promotion  of  one  of  its  modi- 
hers  to  head  position.  In  other  words,  a  permu¬ 
tation  of  semantically  equivalent  words  is  nec¬ 
essary  to  go  from  one  language  to  the  other.  In 
Spanish,  this  divergence  is  typical  in  the  trans¬ 
lation  of  an  English  motion  verb  and  a  preposi¬ 
tion  as  a  directed  motion  verb  and  a  progressive 
verb.  This  divergence  is  less  common  in  the  case 
of  English-Arabic.  Examples  are  given  below. 

(3a)  English-Spanish: 

to  run  in  entrar  corriendo  (enter  running) 
to  fly  about  andar  volando  (go-about  Hying) 

(3b)  English-Arabic: 

to  laugh  the  night  away  ^  ‘UdUI  (pass- 

away  the-night  laughing) 

to  do  something  quickly  -O-  iJssS  (gO“ 

quickly  in  doing  something) 

3.4  Thematic  Divergence 

A  thematic  divergence  occurs  when  the  verb’s 
arguments  switch  thematic  roles  from  one  lan¬ 
guage  to  another.  The  Spanish  verbs  gustar  and 
doler  are  examples  of  this  case.  This  type  of 
divergence  is  very  common  in  Spanish  and  is, 
in  fact,  the  most  abundant  divergence  type  in 
the  TREC  El  Norte  Corpus.  Although  thematic 
divergences  arise  in  the  English-Arabic  case  as 
well,  it  is  less  common.  Consider  the  cases  be¬ 
low. 

(4a)  English-Spanish: 

1  like  grapes  -O-  Me  gustan  uvas  (to-me  please 
grapes) 

1  have  a  headache  -O-  me  duele  la  cabeza  (to-me 
hurt  the  head) 
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(4b)  English-Arabic: 

I  like  grapes  -O-  >_usi!I  (grapes  please-me) 

I  have  a  headache  -O-  j  (my-head 

hurt-me) 

3.5  Categorial  Divergence 

A  categorial  divergence  involves  a  translation 
that  nses  different  parts  of  speech.  In  the 
Enghsh- Spanish  example  below,  the  adjectival 
phrase  is  translated  into  a  light  verb  accompa¬ 
nied  by  a  nominal  version  of  the  adjective.  A 
common  form  of  this  divergence  between  En¬ 
glish  and  Arabic  is  the  nominalization  of  the  En¬ 
glish  verb.  The  examples  below  illnstrate  this 
divergence. 

(5a)  English-Spanish: 

to  be  jealous  -O-  tener  celos  (to  have  jealousy) 
to  be  fully  aware  -O-  tener  plena  conciencia  (have 
full  awareness) 

(5b)  English-Arabic: 

when  he  returns  -O-  “ft j  .Ut  (upon  return-his) 

3.6  Structural  Divergence 

A  structural  divergence  involves  the  realization 
of  incorporated  arguments  such  as  subject  and 
object  as  obliques  (i.e.  headed  by  a  preposition 
in  a  PP).  The  following  are  examples  in  Enghsh- 
Spanish  and  English-Arabic: 

(6a)  English-Spanish: 

to  enter  the  house  -O-  entrar  en  la  casa  (enter  in  the 
house) 

ask  for  a  referendum  -O-  pedir  un  referendum  (ask- 
for  a  referendum) 

(6b)  to  seek  (search  for) 

God  commanded  it  -O-  <111  (commanded 

with-it  God) 

4  Occurrence  of  Divergences  in 
Large  Corpora 

In  theory,  the  divergences  illustrated  in  the  pre¬ 
vious  section  are  common  to  every  language. 
However,  they  may  be  realized  in  different  ways 
in  different  language  pairs.  As  we  have  seen, 
Spanish  may  be  analyzed  as  a  rewriting  of  En¬ 
glish  that  involves  “expansion,”  i.e.,  the  Span¬ 
ish  appears  to  be  more  verbose.  On  the  other 


hand,  the  same  divergence  types  showed  up  dif¬ 
ferently  in  Arabic;  the  rewritings  appear  to  be 
a  “contraction”  of  the  English,  rather  than  “ex¬ 
pansion,”  i.e.,  the  Enghsh  appears  to  be  more 
verbose. 

We  investigated  divergences  in  Arabic  and 
Spanish  corpora  to  determine  how  often  such 
cases  arise. ^  First,  we  developed  a  set  of  hand¬ 
crafted  regular  expressions  for  detecting  diver¬ 
gent  sentences  in  Arabic  and  Spanish  corpora. 
The  Arabic  regular  expressions  were  derived  by 
examining  a  small  set  of  sentences  (50),  a  pro¬ 
cess  which  took  approximately  20  person-hours. 
The  Spanish  expressions  were  derived  by  a  dif¬ 
ferent  process — involving  a  more  general  analy¬ 
sis  of  the  behavior  of  the  language — taking  ap¬ 
proximately  2  person-months. 

We  applied  the  Spanish  and  Arabic  regu¬ 
lar  expressions  to  a  sample  size  of  19K  TREC 
sentences  and  IK  sentences  from  the  Arabic 
Bible.  Each  automatically  detected  divergence 
was  subsequently  human  verihed  and  catego¬ 
rized  into  a  particular  divergence  category.  Ta¬ 
ble  f  indicates  the  percentage  of  cases  we  de¬ 
tected  automatically  and  also  the  percentage  of 
cases  that  were  conhrmed  (by  humans)  to  be 
actual  cases  of  divergence. 

It  is  important  to  note  that  these  numbers 
reflect  the  techniques  used  to  calculate  them. 
Because  the  Spanish  regular  expressions  were 
derived  through  a  more  general  analysis  of  the 
language,  the  precision  is  higher  in  Spanish  than 
it  is  in  Arabic.  Human  inspection  conhrmed  ap¬ 
proximately  1995  Spanish  sentences  out  of  the 
2109  that  were  automatically  detected  (95%  ac¬ 
curacy),  whereas  whereas  124  sentences  were 
conhrmed  in  the  319  detected  Arabic  diver¬ 
gences  (39%  accuracy). 

On  the  other  hand,  the  more  constrained 
Spanish  expressions  appear  to  give  rise  to  a 
lower  recall.  In  fact,  an  independent  study  with 
more  relaxed  regular  expressions  on  the  same 
19K  Spanish  sentences  resulted  in  the  automatic 
detection  of  divergences  in  18K  sentences  (95% 
of  the  corpus),  6.8K  of  which  were  conhrmed  by 

^For  Spanish,  we  used  TREC  Spanish  Data;  for  Ara¬ 
bic,  we  used  an  electronic,  version  of  the  Bible  written 
in  Modern  Standard  Arabic — 28K  verses. 
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Language 

Detected 

Divergences 

Human 

Confirmed 

Sample  Size 
(sentences) 

Corpus  Size 
(sentences) 

Spanish 

11.1% 

10.5% 

19K 

150K 

Arabic 

31.9% 

12.4% 

IK 

28K 

Table  1:  Divergence  Statistics 


humans  to  be  correct  (35%  of  the  corpus).  Fu¬ 
ture  work  will  involve  repeated  constraint  ad¬ 
justments  on  the  regular  expressions  to  deter¬ 
mine  the  best  balance  between  precision  and 
recall  for  divergence  detection;  we  believe  the 
Arabic  expressions  fall  somewhere  in  between 
the  two  sets  of  Spanish  expressions  (which  are 
conjectured  to  be  at  the  two  extremes  of  con¬ 
straint  relaxation — very  tight  in  the  case  above 
and  very  loose  in  our  independent  study). 

5  Experiment:  Impact  of  Divergence 
Correction  on  Alignment 

To  evaluate  our  hypothesis  that  transformations 
of  divergent  cases  can  facilitate  the  word-level 
alignment  process,  we  have  conducted  human 
alignment  studies  for  two  different  pairs  of  lan¬ 
guages:  English- Spanish  and  Enghsh- Arabic. 
We  have  chosen  these  two  pairings  to  test 
the  generality  of  the  divergence  transformation 
principle. 

Our  experiment  involves  four  steps: 

i.  Identify  6  canonical  rewritten  strnctnres — one  for 
each  divergence  category. 

ii.  Antomatically  categorize  English  sentences  into 
one  of  the  6  divergence  categories  (or  “none”)  based 
on  the  foreign  langnage. 

iii.  Apply  the  appropriate  canonical  rewriting  to  each 
divergence-categorized  English  sentence,  renaming 
it  E'. 

iv.  For  each  langnage: 

-  Hnman  align  the  trne  English  sentence  and 
the  foreign- langnage  sentence. 

-  Hnman  align  the  rewritten  E'  sentence  and 
the  foreign- langnage  sentence. 

-  Compare  inter-annotator  agreement  between 
the  hrst  and  second  sets. 

First,  we  describe  the  structural  representation 
used  in  our  canonical  rewritten  structures  used 
in  i-iii  above.  Then  we  wiU  describe  our  exper¬ 
imental  setup  for  step  iv. 


5.1  Structural  Representation  used  in 
our  Approach 

The  structures  used  in  our  approach  are  mod¬ 
eled  after  the  dependency- tree  representations 
used  in  the  Minipar  system  (Lin,  1995;  Lin, 
1998).  We  accommodate  the  divergence  cate¬ 
gories  above  by  rewriting  the  dependency  tree 
so  that  it  is  parallel  to  what  would  be  the  equiv¬ 
alent  Spanish  dependency  tree.  For  example, 
consider  the  sentence  John  kicked  Mary.  Our 
approach  rewrites  the  dependency  tree  for  this 
sentence  as  a  new  dependency  tree  correspond¬ 
ing  to  the  sentence  John  gave  kicks  to  Mary. 

The  transformation  between  these  two  depen¬ 
dency  tree  representations  is  depicted  in  a  sim- 
plihed  format  below: 

(6)(i)  Minipar  Dependency  Tree: 

(kick  root  (John  H  subj)  (mary  W  obj)) 

(6)(ii)  Rewritten  Minipar  Dependency 

Tree: 

(give  root  (John  M  subj) 

(kicks  W  obj)  (to  P  mod  (mary  M  pcomp-n))) 

Table  2  shows  examples  of  the  types  of  sen¬ 
tences  that  were  aligned  with  the  foreign  lan¬ 
guage  in  our  experiment  (including  both  En¬ 
glish  and  E'). 

5.2  Experimental  Setup 

In  each  experiment,  four  fluently  bilingual  hu¬ 
man  subjects  were  asked  to  perform  word-level 
alignments  on  the  same  set  of  sentences  se¬ 
lected  from  the  Bible.  They  were  all  pro¬ 
vided  the  same  instructions  and  software,  sim¬ 
ilar  to  the  methodology  and  system  described 
by  (Melamed,  1998).  Two  of  the  four  subjects 
were  given  the  original  Enghsh  and  foreign  lan¬ 
guage  sentences;  they  served  as  the  control  for 
the  experiment.  The  sentence  given  to  the  other 
two  consisted  of  the  original  foreign  language 


5 


Type 

English 

E' 

Foreign 

Equivalent 

fear 

have  fear 

tiene  miedo 

try 

put  to  trying 

poner  a 

prueba 

make  any  cut¬ 
tings 

wound 

our  hand  is  high 

our  hand 

heightened 

he  is  not  here 

he  be-not  here 

Manner 

teaches 

walks  teaching 

anda 

ensenando 

IS  spent 

self  goes  spend¬ 
ing 

se  va  gas- 
tando 

he  sent  his 

brothers  away 

he  dismissed  his 
brothers 

spake  good  of 

speak-good 

about 

he  turned  again 

he  returned 

Head 

Swapping^ 

Thematic 

walked  out 

Imove-out  walk¬ 
ing 

salio  cami- 

nando 

1  am  pained 

me  pain  they 

me  duelen 

He  loves  it 

to  him  be-loved 

it 

le  gusta 

it  was  on  him 

was  he-wears-it 

y  oS 

Categorial 

i  am  j  ealous 

i  have  j  ealousy 

tengo  celos 

1  require  of  you 

1  require-ol  you 

te  pido 

(he)  shall  esti¬ 
mate 

according-to 
( his  )-estimate 

jjaC 

(how  long  shall) 
the  land  mourn 

(stays)  the-land 
mourning 

<?oLj  jil 

he  went  to 

his-return  to 

Structural 

after  six  years 

after  of  six 

years 

despues  de 

seis  anos 

and  because  of 

that  in  other 
parts 

and  lor  that  in 
other  parts 

y  por  ello  en 
otras  partes 

I  forsake  thee 

I-forsake  about- 

you 

we  found  water 

we-found  on- 

water 

Table  2:  Examples  of  True  English,  E',  and  For¬ 
eign  Equivalent 


sentences  paired  with  altered  English  (denoted 
as  E')  resulting  from  divergence  transforma¬ 
tions  described  above.  We  compare  the  inter¬ 
annotator  agreement  rates  and  other  relevant 
statistics  between  the  two  sets  of  human  sub¬ 
jects.  If  the  divergence  transformations  had  suc¬ 
cessfully  modihed  English  structures  to  match 
those  of  the  foreign  language,  we  would  expect 
the  inter-annotator  agreement  rate  between  the 
subjects  aligning  the  E'  set  to  be  higher  than 
the  control  set.  We  would  also  expect  that  the 
E'  set  would  have  fewer  unaligned  and  multiply- 
aligned  words. 

5.2.1  Experiment  1:  English  and 
Spanish 

For  the  hrst  experiment,  the  subjects  were 
presented  with  150  English- Spanish  sentence 
pairs  from  the  English  and  Spanish  Bibles.  The 
sentence  selection  procedure  is  similar  to  the 


divergence  detection  process  described  in  the 
previous  section.  These  sentences  were  hrst  se¬ 
lected  as  potential  divergences,  using  the  hand¬ 
crafted  regular  expressions  referred  to  in  Sec¬ 
tion  4;  they  were  subsequently  verihed  by  the 
experimenter  as  belonging  to  a  particular  diver¬ 
gence  type.  The  average  length  of  the  English 
sentences  is  25.6  words;  the  average  length  of 
the  Spanish  sentences  is  24.7  words.  Of  the  four 
human  subjects,  two  are  native  Spanish  speak¬ 
ers,  and  two  are  Spanish  literature  concentra¬ 
tors.  The  backgrounds  of  the  four  human  sub¬ 
jects  are  summarized  in  Table  3. 

5.2.2  Experiment  2:  English  and 
Arabic 

In  the  second  experiment,  the  subjects  are 
presented  with  50  Enghsh- Arabic  sentence  pairs 
selected  from  the  English  and  Arabic  Bibles. 
While  the  total  number  of  sentences  is  smaller 
than  the  previous  experiment,  many  sentences 
contain  multiple  divergences.  The  average  En¬ 
glish  sentence  length  is  30.5  words,  and  the 
average  Arabic  sentence  length  is  17.4  words. 
The  backgrounds  of  the  four  human  subjects 
are  summarized  in  Table  4. 

Inter-annotator  agreement  rate  is  quantihed 
for  each  pair  of  subjects  who  viewed  the  same 
set  of  data.  We  hold  one  subject’s  alignments  as 
the  “ideal”  and  compute  the  precision  and  recall 
hgures  for  the  other  subject  based  on  how  many 
alignment  links  were  made  by  both  people.  The 
averaged  precision  and  recall  hgures  (F-scores)'^ 
for  the  the  two  experiments  and  other  relevant 
statistics  are  summarized  in  Table  5.  In  both 
experiments,  the  inter-annotator  agreement  is 
higher  for  the  bitext  in  which  the  divergent  por¬ 
tions  of  the  English  sentences  have  been  trans¬ 
formed.  For  the  English- Spanish  experiment, 
the  agreement  rate  increased  from  79.3%  to 
82.1%,  resulting  in  an  error  reduction  of  13.5%; 
for  the  English- Arabic  experiment,  the  agree¬ 
ment  rate  increased  from  69.5%  to  72.5%,  an 
error  reduction  of  9.8%. 

^Although  cases  of  Head  Swapping  arise  in  Arabic  (as 
shown  in  Section  3.3),  we  did  not  find  any  snch  cases  in 
the  small  sample  of  sentences  that  we  hnman  checked  in 
the  Arabic  Bible. 

4  _  2  X  Precision  X  Recall 

Preci  sion-^  Recall 
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data  set 

native-tongue 

linguistic  knowledge? 

ease  with  computers 

Subject  1 

control 

Spanish 

yes 

high 

Subject  2 

control 

Spanish 

no 

low 

Subject  3 

divergence 

Enghsh 

no 

high 

Subject  4 

divergence 

Enghsh 

no 

low 

Table  3:  A  summary  of  the  backgrounds  of  the  English- Spanish  subjects 


data  set 

native-tongue 

linguistic  knowledge? 

ease  with  computers 

Subject  1 

control 

Arabic 

yes 

high 

Subject  2 

control 

Arabic 

no 

high 

Subject  3 

divergence 

Arabic 

no 

high 

Subject  4 

divergence 

Arabic 

no 

high 

Table  4:  A  summary  of  the  backgrounds  of  the  Enghsh- Arabic  subjects 


Additional  statistics  also  support  our  hypoth¬ 
esis  that  transforming  divergent  English  sen¬ 
tences  facilitates  word-level  alignment  by  re¬ 
ducing  the  number  of  unaligned  and  multiply- 
aligned  words.  In  the  Enghsh- Spanish  experi¬ 
ment,  both  the  appearances  of  unahgned  words 
and  multiply- ahgned  words  decreased  when 
aligning  to  the  modihed  English  sentences.  The 
percentage  of  unaligned  words  decreased  from 
17%  to  14%,  and  the  average  number  of  links 
to  a  word  is  lowered  from  1.35  to  1.16.®  In 
the  English- Arabic  experiment,  the  number  of 
unaligned  words  is  signihcantly  smaller  when 
aligning  Arabic  sentences  to  the  modihed  En¬ 
glish  sentences;  however,  on  average  multiple- 
alignment  increased.  This  may  be  due  to  the 
big  difference  in  sentence  lengths  (Enghsh  sen¬ 
tences  are  typicahy  twice  as  long  as  the  Arabic 
ones);  thus  it  is  not  surprising  that  the  average 
number  of  alignments  per  word  would  be  closer 
to  two  when  most  of  the  words  are  ahgned.  The 
reason  for  the  lower  number  in  the  unmo diked 
Enghsh  case  might  be  that  the  subjects  only 
aligned  words  that  had  clear  translations. 

6  Conclusion  and  Future  Work 

In  this  paper,  we  have  described  six  divergence 
types  that  frequently  occur  in  many  language 
pairs,  such  as  Enghsh- Spanish  and  Enghsh- 

®The  relatively  high  overall  percentage  of  unaligned 
words  is  due  to  the  fact  that  the  subjects  did  not  align 
punctuations. 


Arabic.  By  examining  bitext  corpora,  we  have 
established  conservative  lower-bounds,  estimat¬ 
ing  that  these  divergences  occur  at  least  10% 
of  the  time.  A  realistic  sampling  indicates  that 
the  percentage  is  actually  signihcantly  higher, 
approximately  35%  in  Spanish. 

We  have  shown  that  divergence  cases  can  be 
systematically  handled  by  transforming  the  syn¬ 
tactic  structures  of  the  Enghsh  sentences  to 
bear  a  closer  resemblance  to  those  of  the  foreign 
language  according  to  a  small  set  of  templates. 
The  validity  of  the  divergence  handling  has  been 
verified  through  two  word-level  alignment  ex¬ 
periments.  In  both  cases,  the  human  subjects 
consistently  had  higher  agreement  rate  with 
each  other  on  the  task  of  performing  word-level 
alignment  when  divergent  Enghsh  phrases  were 
transformed.  This  result  suggests  that  diver¬ 
gence  handling  will  signihcantly  improve  auto¬ 
matic  methods  of  word-level  ahgnment  and  fa¬ 
cilitate  cross-language  processing  research  such 
as  creating  foreign  language  treebanks  from  pro¬ 
jected  Enghsh  syntactic  structures. 
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(F-score) 

%  of  unahgned  words 

Avg.  alignments  per  word 

E-S 

79.3 

17.2 

1.35 

E’-S 

82.1 

14.0 

1.16 

E-A 

69.5 

38.5 

1.48 

E’-A 

72.5 

11.9 

1.72 

Table  5:  The  results  of  the  two  experiments.  Note:  in  computing  the  average  number  of  ahgnments 
per  word,  we  do  not  include  unaligned  words. 
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